Skip to content

perf: speed up MemoryStream IPC stream reads#340

Merged
CurtHagenlocher merged 6 commits intoapache:mainfrom
InCerryGit:perf/stream-reader-managed-memory
Apr 28, 2026
Merged

perf: speed up MemoryStream IPC stream reads#340
CurtHagenlocher merged 6 commits intoapache:mainfrom
InCerryGit:perf/stream-reader-managed-memory

Conversation

@InCerryGit
Copy link
Copy Markdown
Contributor

Summary

This improves ArrowStreamReader when reading from MemoryStream instances that expose their underlying buffer. The reader now uses the exposed buffer for IPC message/schema metadata reads while preserving the existing reader-owned body-buffer boundary.

The change is intentionally scoped to MemoryStream-backed IPC stream reads:

  • public/exposed MemoryStream can use the fast path
  • non-public MemoryStream and partial-read streams continue through the fallback stream-read path
  • record batch body data is still copied into allocator-owned memory before array construction
  • ArrowMemoryReader exact-length continuation-token handling is corrected for complete in-memory buffers

Benchmark

BenchmarkDotNet ShortRun, ArrowReaderBenchmark:

Scenario Before After
ArrowReaderWithMemoryStream_ManagedMemory, 100000 rows / 1 column 21629.3 us 7707.6 us
ArrowReaderWithMemoryStream_ManagedMemory, 100000 rows / 5 columns 91112.3 us 40137.5 us

Validation

  • dotnet test test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj -c Release --filter "FullyQualifiedName~Apache.Arrow.Tests.ArrowStreamReaderTests"
  • dotnet test test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj -c Release --filter "FullyQualifiedName~Apache.Arrow.Compression.Tests.ArrowStreamReaderTests"
  • dotnet build Apache.Arrow.sln -c Release

Use exposed MemoryStream buffers for stream reader metadata reads while preserving reader-owned record batch body buffers. Keep exact-length memory-reader continuation handling and add focused stream reader coverage for exact slices, fallback streams, allocator behavior, cancellation, and aliasing.
Cover exposed, non-public, managed-allocator, and explicit default-allocator MemoryStream reader scenarios across row and column counts.

BenchmarkDotNet ShortRun (ArrowReaderBenchmark): 100000 rows / 1 col improved 21629.3 us to 7707.6 us with managed memory; 100000 rows / 5 cols improved 91112.3 us to 40137.5 us.
Avoid Stream.ReadExactly in compression reader tests because Windows CI builds net462 and net472, where that API is unavailable.
Copy link
Copy Markdown
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there are now effectively two different implementations of ArrowMemoryReaderImplementation: one for MemoryStreams and one for every other kind of stream. Given that this is an internal class, would it make more sense to have a separate class ArrowMemoryStreamReaderImplementation : ArrowMemoryReaderImplementation that handles the MemoryStream-specific flavor?

Comment thread test/Apache.Arrow.Benchmarks/ArrowReaderBenchmark.cs Outdated
Comment thread src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs Outdated
@InCerryGit
Copy link
Copy Markdown
Contributor Author

It feels like there are now effectively two different implementations of ArrowMemoryReaderImplementation: one for MemoryStreams and one for every other kind of stream. Given that this is an internal class, would it make more sense to have a separate class ArrowMemoryStreamReaderImplementation : ArrowMemoryReaderImplementation that handles the MemoryStream-specific flavor?

That makes sense. The current change does make ArrowStreamReaderImplementation carry both the general stream path and the exposed MemoryStream path. I’ll move the MemoryStream-specific logic into a separate internal implementation while preserving the existing stream-reader ownership semantics for record batch bodies.

Move the exposed MemoryStream-specific stream reader path into a dedicated internal implementation while keeping stream-reader body ownership semantics.
Copy link
Copy Markdown
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! My only real concern now is that some tests appear to have been inadvertently deleted from ArrowStreamReaderTests. These should be put back.

Comment thread test/Apache.Arrow.Tests/ArrowStreamReaderTests.cs
Comment thread src/Apache.Arrow/Ipc/ArrowMemoryStreamReaderImplementation.cs
Comment thread src/Apache.Arrow/Ipc/ArrowMemoryStreamReaderImplementation.cs Outdated
Comment thread src/Apache.Arrow/Ipc/ArrowStreamReaderImplementation.cs
@CurtHagenlocher CurtHagenlocher merged commit 3434344 into apache:main Apr 28, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants